Ocean Springs
Embedding And Clustering Your Data Can Improve Contrastive Pretraining
Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.
- North America > United States > Montana > Flathead County > Kalispell (0.14)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- North America > Canada (0.04)
- (18 more...)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- (6 more...)
The Cure for Cancer Is Data--Mountains of Data
A few years ago Eric Schadt met a woman who had cancer. It was an aggressive form of colon cancer that had come on quickly and metastasized to her liver. She was a young war widow from Mississippi, the mother of two girls she was raising alone, and she had only the health care that her husband's death benefits afforded her--an overburdened oncologist at a military hospital, the lowest rung on the health care ladder. To walk into such a facility with stage 4 metastatic disease is to walk back in time to the world of the unmapped human genome, when "colon cancer" was understood to have a single cause instead of millions of causes resulting in unique variations, when treatment was the same bag of poison, whether you were in Ocean Springs, Mississippi, or Timbuktu. A time without big data, machine learning, or hope.
- North America > United States > Mississippi > Jackson County > Ocean Springs (0.24)
- Africa > Mali > Tombouctou > Timbuktu (0.24)
- North America > United States > New York (0.05)
- Asia > China > Beijing > Beijing (0.04)